DAPO: An Open-Source LLM Reinforcement Learning System at Scale 1ByteDance Seed2Institute for AI Industry Research (AIR), Tsinghua University 3The University of Hong Kong 4SIA-Lab of Tsinghua AIR and ByteDance Seed Full author list in Contributions Abstract Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the Decoupled Clip and Dynamic s Ampling Policy Optimization ( DAPO) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verlframeworka, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL. Date:March 17, 2025 Correspondence: Qiying Yu at yuqy22@mails.tsinghua.edu.cn Project Page: https://dapo-sia.github.io/ ahttps://github.com/volcengine/verl 0 2000 4000 6000 8000 10000 Training Steps01020304050607080AIME 2024 Accuracy (%) DAPO DeepSeek-R1-Zero-Qwen-32B DAPO avg@32 DAPO pass@32 DAPO cons@32 Figure 1 AIME 2024 scores of DAPOon the Qwen2.5-32B base model, outperforming the previous SoTA DeepSeek- R1-Zero-Qwen-32B using 50% training steps. 1arXiv:2503.14476v1 [cs.LG] 18 Mar 2025 1 Introduction Test-time scaling such as OpenAI’s o1 [ 1] and DeepSeek’s R1 [ 2] brings a profound paradigm shift to Large Language Models (LLMs) [ 3–7]. Test-time scaling enables longer Chain-of-Thought thinking and induces sophisticated reasoning behaviors, which makes the models superior in competitive math and coding tasks like AIME and Codeforces. The central technique driving the revolution is large-scale Reinforcement Learning (RL), which elicits complex reasoning behaviors such as self-verification and iterative refinement. However, the actual algorithm and key recipe for scalable RL training remains a myth, hidden from technical reports of existing reasoning models [1,2,8–11]. In this paper, we reveal significant obstacles in large-scale RL training and open-source a scalable RL system with fully open-sourced algorithm, training code and dataset that provides democratized solutions with industry-level RL results. We experiment over Qwen2.5-32B [ 12] as the pretrained model for RL. In our initial GRPO run, we achieved only 30 points on AIME — a performance significantly below DeepSeek’s RL (47 points). A thorough analysis reveals that the naive GRPO baseline suffers from several key issues such as entropy collapse, reward noise, and training instability. The broader community has encountered similar challenges in reproducing DeepSeek’s results [13–19] suggesting that critical training details may have been omitted in the R1 paper that are required to develop an industry-level, large-scale, and reproducible RL system. To close this gap, we release an open-source state-of-the-art system for large-scale LLM RL, which achieves 50 points on AIME 2024 based on Qwen2.5-32B model, outperforming previous state-of-the-art results achieved by DeepSeek-R1-Zero-Qwen-32B [ 2] (47 points) using 50% training steps (Figure 1). We propose the Decoupled Clip and Dynamic s Ampling Policy Optimization ( DAPO) algorithm, and introduce 4 key techniques to make RL shine in the long-CoT RL scenario. Details are presented in Section 3. 1.Clip-Higher , which promotes the diversity of the system and avoids entropy collapse; 2.Dynamic Sampling , which improves training efficiency and stability; 3.Token-Level Policy Gradient Loss , which is critical in long-CoT RL scenarios; 4.Overlong Reward Shaping , which reduces reward noise and stabilizes training. Our implementation is based on verl [ 20]. By fully releasing our state-of-the-art RL system including training code and data, we aim to reveal valuable insights to large-scale LLM RL that benefit the larger community. 2 Preliminary 2.1 Proximal Policy Optimization (PPO) PPO [21] introduces a clipped surrogate objective for policy optimization. By constraining the policy updates within a proximal region of the previous policy using clip, PPO stabilizes training and improves sample efficiency. Specifically, PPO updates the policy by maximizing the following objective: JPPO(θ) =E(q,a)∼D,o≤t∼πθold(·|q)" min πθ(ot|q, o